Using calculus, humanity has designed safer roads, launched rockets into space, and tracked the motion of the planets and the spread of pandemics. In the AI era, calculus also gives us a new power: to create machines that think (or at least are very good at predicting the next word in a sentence). All of this comes from the simple ability to determine the slope of a function.
AI, at its core, is about making predictions. It might feel like you are asking AI a question and it magically retrieves an answer. But AI is in fact just a big prediction engine. It’s making a prediction (or best guess) of what the answer might be based on all the data it has access to and how it’s been trained.
When you ask an AI model to write you a paragraph, what it is actually doing is predicting the next token (a chunk of a word, usually just a few letters) and then stringing them all together. LLMs (Large Language Models) work by just predicting what small part comes next and doing this over and over.
With enough data, this allows for predicting complex outputs that approach genuine reasoning. How far can it go? We still don’t know.
For AI to happen, you need three things. You need a model, data, and training. The model is the prediction machine itself. Data comes in two parts: examples to give to the model and the answers that the model's predictions can be compared against.
Think of training a model like tuning a string on a guitar. You strum the string and listen to the sound it produces. Then you tighten it or loosen it to get it closer to the sound you expect. If it’s too high, you decrease the pitch. If it’s too low, you increase the pitch. Eventually, you find a pitch that is just right.
To train a model, what we’re tuning isn’t pitch, but increasingly large amounts of tiny numbers called parameters. Each parameter subtly changes how the model behaves, like how each tuning peg slightly adjusts how the guitar will sound. We tune these parameters until the predictions match what we expect.
Training is a five-step process.
Great! So how do we actually do that? The most challenging step is step four. On a guitar, it’s relatively easy: Tighter is higher and looser is lower. But with AI models, changing any one parameter affects the model in ways we can’t easily predict.
So how do we do it?
We use calculus. Specifically gradients, which describe how each parameter influences the accuracy of the model. We'll show you the specifics one step at a time.
Linear regression uses the same mathematical principles as modern AI, but with just two parameters we can actually see and understand. Our task is to derive two parameters, $m$, and $b$ for $y=mx+b$ such that it matches the x-y values best. Or rather, so that for each datapoint $(x_n, y_n)$, we want $mx_n+b≈y_n$. We have data, in the forms of items we want to fit. The premise is simple: Given the x value, predict the y value. For example, you might set up a linear regression model to predict house price given square footage.
But how do we find $m$ and $b$?
Try sliding around $m$ and $b$ to try to match it with the data
The first thing is to more numerically define what denotes a good match between the line and data points. In other words, a heuristic for success. This is called a loss function. For each datum and prediction, we want to punish guesses that were far off and reward close guesses. One formula you could imagine is $L=(y_{real}-y_{pred})^2$. Subtracting the real value from the predicted value makes sense. After all, a larger difference is bad, and that corresponds to a higher loss. Why squared? Three reasons: (1) It makes negative differences positive, (2) it punishes big mistakes much more than small ones, and (3) squared functions have nice mathematical properties for finding minimums. So this makes sense. You have some set $m$ and b, and for any given data point, you can calculate the loss $L$. You could calculate the overall loss by summing the individual errors and averaging them. This is called MSE, for Mean Squared Error, and is a very common loss metric used by professionals in machine learning. Here’s the formula:
$\text{MSE}=\frac{1}{n}\sum_{ }^{ }\left(y_{real}-y_{pred}\right)^{2}$
Try sliding around $m$ and b, with the goal of minimizing L.
Here's the key insight: instead of guessing $m$ and $b$ randomly, we can treat this as a math problem. We now have a function $L(m,b)$ that tells us how wrong our line is for any choice of $m$ and $b$. We want to find the $m$ and $b$ such that it minimizes the output. Now the really cool part is that we can actually in this case graph $m$, $b$, and $L(m,b)$ in 3D since there are only two parameters. Guessing numbers that produce a line that matches data is very unscientific, but the process of minimizing a function's output by changing its inputs has been thoroughly studied and can be approached mathematically. This idea of plotting loss over the parameters produces what is called a loss landscape.
Drag around the 3D model on the right and explore the descriptions and equations on the left.
These same principles work even if there are millions of parameters. But we have to be clever. When there are so many parameters, we can't just plot all the possible options and look for the smallest one. Let's start simple with just one parameter to see how calculus helps us find the minimum. We have a normal function $y=f(x)$, and we want to find the lowest point.
Try dragging the orange point along the invisible curve. Click the black text to reveal the function once you think you've found the lowest point.
The derivative of a function essentially gives us the instantaneous slope at that point. With that we can construct the tangent line. Try dragging around $a$ and seeing what happens with respect to the tangent line (and the value of $f'(a)$) on either side of the lowest point of the function.
Try dragging the orange point along the invisible curve, paying attention to the tangent line. Click the black text to reveal the function once you think you've found the lowest point. The tangent line should be horizontal at the lowest point, because the derivative is zero.
The mathematical reason behind this is that the derivative is zero at local extrema since the function switches between going down and going up (so the derivative switches from a negative number to a positive number, stopping at zero in the middle)
On the left of the lowest point, the derivative is a negative number. On the right, it's a positive number. In either case, it's steeper when it's further away. You might discover a strategy. If the derivative points down and to the right, go to the right. If it points down and to the left, go to the left. If it’s mostly flat, you’re almost there. If the slope is instead very steep, take a bigger step. Almost like tuning a guitar to find the right pitch.
Click the arrow symbol on the left to advance the point, following the algorithm outlined. As you progress, how does our point appear to move? How does $f'(a$) behave?
Now, in 3D (or any amount more of dimensions), the method is almost the same. In 2D, the strategy was to find the tangent line and go in the opposite direction it's pointing. The same applies in 3D, but we have to take the derivative with respect to each parameter, since it's multiple dimensions.
Click the arrow on the left of the interactive and watch the point descend down the green surface over time.
Click here
When we find the lowest point on that 3D surface, we've found the best values for $m$ and $b$ - the ones that make our line fit the data points as closely as possible. Play around on the final graph:
Try clicking the arrow next to Update several times and see how the red line behaves. Click the arrow next to Reset to reset the interactive.
A note on learning rates. You can't set the learning rate coefficient $k$ too high because it would overcorrect, then overcorrect again, etc, until $m_1$ and $b_1$ spiral out of control. But can't set it too low because it would run into local minima all the time. Picture a landscape with many small valleys - if your steps are too small, you might get stuck in a shallow valley and never find the deepest one. This phenomenon is called gradient trapping.
So that is what 'training' really means - mathematical optimization of parameters for accurate predictions. Here’s an expanded sketch of what our five-step training process looks like with our newfound understanding:
All this process is doing is trying to find the lowest point of the loss landscape in the most efficient way possible. And in minimizing the loss, the model is able to form good predictions. With this newly-minted prediction machine, the hope is that it will be able to generalize further and correctly predict things it hasn’t seen before.
Appendix
Looking for more? Here is some suggested reading from outside sources.
© 2025 onward Jacob Buckhouse. All rights reserved.